Shared pipeline architecture for motion vector prediction and residual decoding

ABSTRACT

A shared pipeline architecture is provided for H.264 motion vector prediction and residual decoding, and intra prediction for CABAC and CALVC entropy in Main Profile and High Profile for standard and high definition applications. All motion vector predictions and residual decoding of I-type, P-type, and B-type pictures are completed through the shared pipeline. The architecture enables better performance and uses less memory than conventional architectures. The architecture can be completely implemented in hardware as a system-on-chip or chip set using, for example, field programmable gate array (FPGA) technology or application specific integrated circuitry (ASIC) or other custom-built logic.

RELATED APPLICATIONS

This application is a continuation of U.S. Ser. No. 11/138,846, filedMay 25, 2005, titled “Shared Pipeline Architecture For Motion VectorPrediction And Residual Decoding”, published as U.S. Patent ApplicationPublication 2006/0126740A1 on Jun. 15, 2006, which claims the benefit ofU.S. Provisional Application No. 60/635,114, filed on Dec. 10, 2004. Inaddition, this application is related to U.S. Ser. No. 11/137,971, filedMay, 25, 2005, titled “Digital Signal Processing Structure for DecodingMultiple Video Standards”, published as U.S. Patent ApplicationPublication 2006/0126740A1 on Jun. 15, 2006. Each of these applicationsis herein incorporated in its entirety by reference.

FIELD OF THE INVENTION

The invention relates to video processing, and more particularly, to ashared pipeline architecture for carrying out the decoding process ofCABAC and CALVC entropy of H.264 main profile and high profile.

BACKGROUND OF THE INVENTION

The H.264 specification, also known as the Advanced Video Coding (AVC)standard, is a high compression digital video codec standard produced bythe Joint Video Team (JVT), and is identical to ISO MPEG-4 part 10. TheH.264 standard is herein incorporated by reference in its entirety.

H.264 CODECs can encode video with approximately three times fewer bitsthan comparable MPEG-2 encoders. This significant increase in codingefficiency (e.g., good video quality at bit rates below 2 Mbps) meansthat more quality video data can be sent over the available channelbandwidth. In addition, video services can now be offered inenvironments where they previously were not possible. H.264 CODECs wouldbe particularly useful, for instance, in high definition television(HDTV) applications, bandwidth limited networks (e.g., streaming mobiletelevision), personal video recorder (PVR) and storage applications forhome use, and other such video delivery applications (e.g., digitalterrestrial TV, cable TV, satellite TV, video over xDSL, DVD, anddigital and wireless cinema).

In general, all standard video processing (e.g., MPEG-2 or H.264)encodes video as a series of pictures. For video in the interlacedformat, the two fields of a frame can be encoded together as a framepicture, or encoded separately as two field pictures. Note that bothtypes of encoding can be used in a single interlaced sequence. Theoutput of the decoding process for an interlaced sequence is a series ofreconstructed fields. For video in the progressive format, all encodedpictures are frame pictures. Here, the output of the decoding process isa series of reconstructed frames.

Encoded pictures are classified into three types: I, P, and B. I-typepictures represent intra coded pictures, and are used as a predictionstarting point (e.g., after error recovery or a channel change). Here,all macro blocks are coded without prediction. P-type pictures representpredicted pictures. Here, macro blocks can be coded with forwardprediction with reference to previous I-type and P-type pictures, orthey can be intra coded (no prediction). B-type pictures represent bidirectionally predicted pictures. Here, macro blocks can be coded withforward prediction (with reference to previous I-type and P-typepictures), or with backward prediction (with reference to next I-typeand P-type pictures), or with interpolated prediction (with reference toprevious and next I-type and P-type pictures), or intra coded (noprediction). Note that in P-type and B-type pictures, macro blocks maybe skipped and not sent at all. In such cases, the decoder uses theanchor reference pictures for prediction with no error.

The advanced coding techniques of the H.264 specification operate withina similar scheme as used by previous MPEG standards. The higher codingefficiency and video quality are enabled by a number of features,including improved motion estimation and inter prediction, spatial intraprediction and transform, and context-adaptive binary arithmetic coding(CABAC) and context-adaptive variable length coding (CAVLC) algorithms.

As is known, motion estimation is used to support inter pictureprediction for eliminating temporal redundancies. Spatial correlation ofdata is used to provide intra picture prediction (prior to thetransform). Residuals are constructed as the difference betweenpredicted images and the source images. Discrete spatial transform andfiltering is used to eliminate spatial redundancies in the residuals.H.264 also supports entropy coding of the transformed residualcoefficients and of the supporting data such as motion vectors.

Entropy is a measure of the average information content per sourceoutput unit, and is typically expressed in bits/pixel. Entropy ismaximized when all possible values of the source output unit are equal(e.g., an image of 8-bit pixels with an average information content of 8bits/pixel). Coding the source output unit with fewer bits, on average,generally results in information loss. Note, however, that the entropycan be reduced so that the image can be coded with fewer than 8bits/pixel on average without information loss.

The H.264 specification provides two alternative processes of entropycoding—CABAC and CAVLC. CABAC provides a highly efficient encodingscheme when it is known that certain symbols are much more likely thanothers. Such dominant symbols may be encoded with extremely smallbit/symbol ratios. CABAC continually updates frequency statistics of theincoming data, and adaptively adjusts the coding algorithm in real-time.CAVLC uses multiple variable length codeword tables to encode transformcoefficients. The codeword best table is selected adaptively based on apriori statistics of already processed data. A single table is used fornon-coefficient data.

The H.264 specification provides for seven profiles each targeted toparticular applications, including a Baseline Profile, a Main Profile,an Extended Profile, and four High Profiles. The Baseline Profilesupports progressive video, uses I and P slices, CAVLC for entropycoding, and is targeted towards real-time encoding and decoding for CEdevices. The Main Profile supports both interlaced and progressive videowith macro block or picture level field/frame mode selection, and usesI, P, B slices, weighted prediction, as well as both CABAC and CAVLC forentropy coding. The Extended Profile supports both interlaced andprogressive video, CAVLC, and uses I, P, B, SP, SI slices.

The High Profile extends functionality of the Main Profile for effectivecoding. The High Profile uses adaptive 8×8 or 4×4 transform, and enablesperceptual quantization matrices. The High 10 Profile is an extension ofthe High Profile for 10-bit component resolution The High 4:2:2 Profilesupports 4:2:2 chroma format and up to 10-bit component resolution(e.g., for video production and editing). The High 4:4:4 Profilesupports 4:4:4 chroma format and up to 12-bit component resolution. Italso enables lossless mode of operation and direct coding of the RGBsignal (e.g., for professional production and graphics).

Given that the H.264 standard is relatively new, there is currently alimited selection of available H.264 coding architectures. What isneeded, therefore, are coding architectures that are H.264 enabled.

SUMMARY OF THE INVENTION

One embodiment of the present invention provides a video decoding systemconfigured with a multi-stage shared pipeline architecture for carryingout the H.264 CABAC and CALVC entropy decoding processes in Main Profileand High Profile. The system includes a first stage for grouping macroblock properties into 4×4 sub blocks. A second stage is for performingseparation of coefficients and run level pairs. A third stage is forperforming run level pair decoding. A fourth stage is for performing 4×4block zigzag transform, DC/AC coefficient merging, and motion vectorprediction. A fifth stage is for performing 8×8 transforms in the HighProfile, and is skipped in the Main Profile.

The system may further include a line buffer used for motion vectorprediction of inter prediction and intra prediction mode decoding. Inone particular embodiment, the fourth stage has four split modes for themotion vector prediction: 16×16, 16×8, 8×16, and 8×8, and each splitmode has its own motion vector computation logic, reference pictureindex computation, and frame field scaling. For the 8×8 split mode,there can further be four sub-split modes: 8×8, 8×4, 4×8, and 4×4. Inanother particular embodiment, B-type picture decoding has a dualchannel read write structure, and logic index ID searching is performedon a block by block basis. In one such case, a block N+1 logic index IDis calculated while motion vector prediction is performed for block N,thereby mitigating searching time.

The first stage may include a residual decoder state machine forcarrying out residual decoding of I-type, P-type, and B-type pictures.The first stage may include a PCM raw data state machine for carryingout a PCM raw data mode. The second stage may include a state machinefor carrying out separation of coefficients and run level pairs. Thethird stage may include a state machine for carrying out run level pairdecoding. The fourth and fifth stages may include a state machine forcarrying out transform, DC/AC coefficient merging, and motion vectorprediction.

The system may further include a memory interfaced to the first stagevia an N-bit bus, wherein there are a plurality of syntax groups ofN-bit data inside each macro block, and the memory provides the firststage with the length for every syntax group. In one such case, thefirst stage includes a memory control state machine that collectsinformation about macro block properties and groups it into 4×4 subblocks for all predictions. In another such case, N=24, and the N-bitsdesignate current macro block properties including at least one of:intra prediction, inter prediction, skip mode, raw data mode, macroblock ID, slice ID, direct mode, macro block split mode, sub block splitmode which can be down to 4×4 for luma and 2×2 for chroma and allcorresponding intra prediction flag and motion vector differences on Xand Y direction, picture reference index for forward and backwardpredictions, CABAC bit map and level, and CALVC run level information.

The system may further include a dual channel read memory interfaced tothe first stage via an N-bit bus, wherein for frame pictures, channel 1of the memory is for even macro block row reference reads and channel 2of the memory is for odd macro block reference read, and for fieldpictures, channel 1 of the memory is for the top field picture andchannel 2 of the memory is for the bottom field picture reference read.In another such embodiment, the system further includes a dual channelread memory interfaced to the first stage via an N-bit bus, wherein ifthe current decoding picture is a frame picture and a correspondingreference picture is a field picture, either top field or bottom field,then both channel 1 and channel 2 of the memory have the same referencepicture, thereby facilitating B-type picture decoding. If the currentdecoding picture is a B-type picture, then the reference picture can beread from, for example, a DDR SDRAM for direct mode motion vectorprediction. In one specific embodiment, the system includes a dualchannel read memory interfaced to the first stage via a 128-bit bus, andfor every macro block there are five data beats of 128 bits that containmacro block properties, logical ID and physical ID of every 8×8 blockinside a 16×16 macro block, and motion vectors of every 4×4 block.

The system may further include a dual channel write memory interfaced tothe fifth stage via an N-bit bus, wherein if the current decodingpicture is a frame picture, then even rows are written out throughchannel 1 and odd rows are written out through channel 2. If the currentdecoding picture is a field picture, either top field or bottom field,then the top field is written out through channel 1 and the bottom fieldis written out through channel 2 of the memory, thereby facilitatingB-type picture decoding. In one particular case, if the current decodingpicture is a reference picture, then properties for each macro block aresaved to a DDR SDRAM and read back when the current decoding picture isa B-type picture.

The system may further include a fractional interpolation block andin-line loop filter (FIB/ILF) memory interfaced to the fifth stage via a128-bit bus, wherein for every macro block, there are 18 data beats of128 bits that contain macro block properties and expanded motion vectorinformation of luma and chroma for forward and backward reference. Also,a DSP macro block header/data memory may be interfaced with the fifthstage via the 128-bit bus, wherein for every 16×16 macro block, the DSPmacro block header/data memory has 1 data beat of 128 bits for a headerthat includes information for macro block properties and 48 data beatsof 128 bits for coefficients. Also, a dual channel write memory may beinterfaced with the fifth stage via the 128-bit bus, wherein writechannel 1 of the memory is for writing even row macro blocks of framepictures and top fields of field pictures, while write channel 2 of thememory is for writing odd row macro blocks of frame pictures and bottomfields of field pictures. In one such configuration, each of the FIB/ILFmemory, the DSP macro block header/data memory, and the dual channelwrite memory are implemented using a dual port SRAM interfaced with thefifth stage via the shared 128-bit bus. The system may further include amacro block quantized DCT coefficient memory that is interfaced to thefifth stage via a 128-bit bus, for storing dequantized DC/ACcoefficients. This memory may also be, for example, a dual port SRAM.Note that single port SRAMs could also be used, assuming bandwidthrequirements could be satisfied. Single port SRAM saves on chip spaceand power consumption, but may not be appropriate depending on thebandwidth requirements.

Another embodiment of the present invention provides a video decodingsystem configured with a multi-stage shared pipeline architecture forcarrying out the H.264 CABAC and CALVC entropy decoding processes inMain Profile and High Profile. This particular system includes a firststage including a memory control state machine that collects informationabout macro block properties and groups it into 4×4 sub blocks for allpredictions. A second stage is for performing separation of coefficientsand run level pairs. A third stage is for performing run level pairdecoding. A fourth stage is for performing 4×4 block zigzag transform,DC/AC coefficient merging, and motion vector prediction. Here, thefourth stage has four split modes for the motion vector prediction:16×16, 16×8, 8×16, and 8×8, and each split mode has its own motionvector computation logic, reference picture index computation, and framefield scaling. For the 8×8 split mode, there can further be foursub-split modes: 8×8, 8×4, 4×8, and 4×4. A fifth stage is for performing8×8 transforms in the High Profile, and is skipped in the Main Profile.Dual read channels and dual write channels are provided for direct modemotion vector prediction of B-type pictures.

The system may further include a memory interfaced to the first stagevia a 24-bit bus, wherein there are a plurality of syntax groups of24-bit data inside each macro block, and the memory provides the firststage with the length for every syntax group. A fractional interpolationblock and in-line loop filter (FIB/ILF) memory may be interfaced to thefifth stage via a 128-bit bus, wherein for every macro block, there are18 data beats of 128 bits that contain macro block properties andexpanded motion vector information of luma and chroma for forward andbackward reference. A DSP macro block header/data memory may beinterfaced to the fifth stage via the same 128-bit bus, wherein forevery 16×16 macro block, the DSP macro block header/data memory has 1data beat of 128 bits for a header that includes information for macroblock properties and 48 data beats of 128 bits for coefficients. A macroblock quantized DCT coefficient memory may be interfaced to the fifthstage via a 128-bit bus, for storing dequantized DC/AC coefficients.

The features and advantages described herein are not all-inclusive and,in particular, many additional features and advantages will be apparentto one of ordinary skill in the art in view of the figures anddescription. Moreover, it should be noted that the language used in thespecification has been principally selected for readability andinstructional purposes, and not to limit the scope of the inventivesubject matter.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1A is a block diagram of a video decoding system configured with amulti-stage pipeline architecture in accordance with one embodiment ofthe present invention.

FIG. 1B illustrates a motion vector prediction scheme configured inaccordance with one embodiment of the present invention.

FIG. 2A illustrates an ISE memory control state machine configured inaccordance with one embodiment of the present invention.

FIG. 2B illustrates a residual decoder state machine configured inaccordance with one embodiment of the present invention.

FIG. 3A illustrates a back P-vector read state machine configured inaccordance with one embodiment of the present invention.

FIG. 3B illustrates a back P-vector write state machine configured inaccordance with one embodiment of the present invention.

FIG. 4 illustrates a PCM raw data state machine configured in accordancewith one embodiment of the present invention.

FIG. 5 illustrates a syntax group math state machine configured tocarryout separation of coefficients and run level pairs in accordancewith one embodiment of the present invention.

FIG. 6 illustrates a syntax group math state machine configured tocarryout run level pair decoding in accordance with one embodiment ofthe present invention.

FIG. 7 illustrates a state machine configured to carryout transform,coefficient merging and motion vector prediction in accordance with oneembodiment of the present invention.

DETAILED DESCRIPTION OF THE INVENTION

A pipeline architecture is provided for H.264 motion vector predictionand residual decoding, and intra prediction for CABAC and CALVC entropyin Main Profile and High Profile for standard and high definitionapplications. All motion vector predictions and residual decoding ofI-type, P-type, and B-type pictures are completed through the sharedpipeline. The architecture enables better performance and uses lessmemory than conventional architectures. The architecture can becompletely implemented in hardware as a system-on-chip or chip setusing, for example, field programmable gate array (FPGA) technology orapplication specific integrated circuitry (ASIC) or other custom-builtlogic.

In one particular embodiment, a FIFO (first in, first out) queuearchitecture in a first stage of the pipeline separates the block syntaxgroup into 4×4 sub-block syntax groups, which are then fed into the nextstage of the pipeline. Because operation is 4×4 sub-block based, thedata format can be accessed more efficiently (better flow control) andthe architecture design more structured (for higher performance),relative to conventional architectures. A line buffer is shared by theY, U, and V components of the video signal during the inter predictionand intra prediction decoding. Dual channels for read and write simplifyarchitecture for direct mode motion vector prediction of B-typepictures.

Multi-Stage Pipeline Architecture

FIG. 1A is a block diagram of a video decoding system configured with amulti-stage pipeline architecture in accordance with one embodiment ofthe present invention. This architecture can be used, for example, forcarrying out the H.264 CABAC and CALVC entropy decoding processes inMain Profile and High Profile for standard and high definitionapplications, and in particular, for carrying out motion vectorprediction, residual run level pair decoding, intra prediction, and dualchannel read write control for B-type picture decoding.

As can be seen, the system generally includes input memory, a macroblock builder, and output memory. The input memory includes an ISEmemory, a back P-vector input memory 1, and a back P-vector input memory2. The macro block builder includes a five stage pipeline (S1 throughS5) architecture and a line buffer. The output memory includes anfractional interpolation block and in-line loop filter (FIB/ILF) memory,a digital signal processor (DSP) macro block header/data memory, a backP-vector output memory 1, a back P-vector output memory 2, and a macroblock quantized discrete cosine transform (QDCT) coefficient memory.

This architecture can be implemented, for example, as a system-on-chip(e.g., FPGA or ASIC technology, or other custom built integratedcircuitry). The on-chip memories can be implemented using single or dualport SRAMs. For instance, the back P-vector input memory 1 and backP-vector input memory 2 can be implemented as one dual port SRAM. Theline buffer can be implemented as a single port SRAM. The FIB/ILFmemory, DSP macro block header/data memory, back P-vector output memory1, and back P-vector output memory 2 can each be implemented as one offour sections of another dual port SRAM. Note that other on-chip andoff-chip support elements and functionality, such as memory (e.g., DDRSDRAM) and FIB/ILF/DSP modules (for carrying out fractionalinterpolation, inline filtering, and digital signal processingfunctions), can also be provided as part of the overall system. A numberof hardware implementations will be apparent in light of thisdisclosure.

As previously stated, there are five stages inside the macro blockbuilder structure. The first stage (S1) mainly handles the interface forthe syntax grouping, back P-vector retrieving and PCM raw data modebypassing through the pipeline. In one embodiment, there is one syntaxFIFO and one corresponding length property FIFO for this stage. Thestate machines associated with this stage are: ISE memory control statemachine (FIG. 2A), residual decoder state machine (FIG. 2B), backP-vector read state machine (FIG. 3A), and PCM raw data state machine(FIG. 4). All the results are written into a FIFO between stages S1 andS2.

The second stage (S2) mainly performs the separation of run and level.Run indicates the number of zeroes between neighboring coefficients,level is the value of the coefficient. This is executed by theseparation of coefficient and run level pair state machine (FIG. 5), Theresults are written into a FIFO between stages S2 and S3.

The third stage (S3) decodes the run level pair. The coefficients arewritten into the proper position of a coefficient register according tothe run information. This is controlled by the run level pair decodingstate machine (FIG. 6), and the results are saved into a FIFO betweenstages S3 and S4.

The fourth stage (S4) is the main stage that controls all the 4×4 subblock count and matching the coefficient with the motion vector decodingthat is performed on a 4×4 basis. The intra mode prediction is alsocalculated here and shares the same line buffer entry. For progressiveframes, there are four entries of 80 bits for each of the macro blockproperties. For adaptive frames, there are eight entries of 80 bits foreach macro block pair. A field/frame zigzag scan will be handled herealso. All the results are written out to a memory (e.g., dual port SRAMshared by ILF/FIB data elements, DSP data elements, and Back P-vectordata elements). This is controlled by the transform, coefficient mergingand motion vector prediction state machine and the back. P-vector writestate machine (FIG. 7).

The fifth stage (S5) mainly handles High Profile for CABAC and CALVCmodes. A zigzag scan of 8×8 for both frame and field is executed here,with the results written out to a memory similar to the fourth stage S4(e.g., dual port SRAM). Note that this stage is skipped in the MainProfile. This is controlled by the transform, coefficient merging andmotion vector prediction state machine and the back P-vector write statemachine (FIG. 7).

In each pipeline stage, note that there is one or more state machinesassociated with flow control. Each of the state machines will bediscussed in reference to FIGS. 2A through 7.

In one particular embodiment, stage S1 of the pipeline is an input FIFOqueue architecture that is configured for eight entries. Each entry hasthe associated macro block properties of length, zero residual, andtotal coefficients. Even though each FIFO entry has a maximum ofsub-entries (e.g., seventeen), the length property indicates how manyentries are valid.

The motion vector prediction may have a number of split modes. In onesuch embodiment, there are four split modes for the motion vectorprediction: 16×16, 16×8, 8'16, and 8×8. Each split mode has its ownmotion vector computation logic, reference picture index computation,and frame field scaling. For the 8×8 split mode, there can also be foursub-split modes: 8×8, 8×4, 4×8, and 4×4. The B-type picture decoding hasa dual channel read write structure, and the logic index ID searching isperformed on a block by block basis. The block 1 logic index ID can becalculated while motion vector prediction is performed for block 0.Likewise, the block 2 logic index ID can be calculated while motionvector prediction is performed for block 1. Likewise, the block 3 logicindex ID can be calculated while motion vector prediction is performedfor block 2. Thus, block N+1 logic index ID can be calculated whilemotion vector prediction is performed for block N. This effectivelyhides or mitigates the searching time, and uses simple logic without theneed for content addressable memory macros. In addition, for B-typepicture decoding, the reference picture can be read from the dual readchannel structure. If the current macro block is direct mode, then thereference picture can be used to decode the current macro block. If thecurrent macro block in not direct mode, the reference picture isdiscarded. In this case, the B-type picture decoding will be similar toP-type picture decoding. If the current picture is a reference picture,then the macro block information can be written out to buffer (e.g.,dual port SRAM) via the dual write channel structure.

In the example shown, the ISE syntax memory interfaces to the ISE memorycontrol state machine (FIG. 2A) of stage S1 of the pipeline via a 24-bitbus. The ISE syntax memory indicates underflow when there is not enoughdata for one macro block decoding. There are several syntax groups of24-bit data inside each macro block, and the ISE syntax memory providesthe ISE memory control state machine of stage S1 with the length forevery syntax group. Each of the 24 bits has an assigned meaningaccording to a known internal specification.

For example, the 24 bits collectively include the properties of thecurrent macro block, such as: intra prediction, inter prediction, skipmode, raw data mode, macro block ID, slice ID, direct mode, macro blocksplit mode, sub block split mode (which could be down to 4×4 for lumaand 2×2 for chroma and all the corresponding intra prediction flag andmotion vector differences on X and Y direction), picture reference indexfor forward and backward predictions, CABAC bit map and level, and CALVCrun level information. The ISE memory control state machine of stage S1collects all such information and groups it into 4×4 sub blocks for allpredictions.

The ISE memory control state machine (FIG. 2A), residual decoder statemachine (FIG. 2B), and the PCM raw data state machine (FIG. 4) can beused, for example, to control this interface between the ISE memory andstage S1 of the pipeline architecture.

The back P-vector input memory 1 (read channel 1) and the back P-vectorinput memory 2 (read channel 2) interface to stage SI of the pipelinevia a 128-bit bus in this example embodiment. There are underflow1 andundeflow2 indicators for each read channel memory. For frame pictures,channel 1 is for even macro block row reference read and channel 2 isfor odd macro block reference read. For field pictures, channel 1 is forthe top field picture and channel 2 is for the bottom field picturereference read. If the current decoding picture is a frame picture andthe reference picture is field picture, either top field or bottomfield, then both channels will have the same reference picture. This isa special case arrangement to facilitate the B-type picture decoding.For every macro block, there are five data beats of 128 bits thatcontain macro block properties, logical ID and physical ID of every 8×8block inside the 16×16 macro block, and motion vectors of every 4×4block.

In one embodiment, the back P-vector input memory 1 and back P-vectorinput memory 2 are implemented with a dual port SRAM. If the currentdecoding picture is B-type picture, then the reference picture can beread back, for example, from an off-chip double data rate-synchronousDRAM (DDR SDRAM) for direct mode motion vector prediction. Other memorydevices can be used here as well, with factors such as power consumptionand access speed determining the type of memory selected. The backP-vector read state machine (FIG. 3A) can be used, for example, tocontrol the reference picture retrieving.

The line buffer is used mainly for motion vector prediction of interprediction and intra prediction mode decoding, and in this exampleembodiment, interfaces to stages S4 of the pipeline via a 80-bit bus. Inone particular embodiment, the line buffer holds one row of macro blockproperties for one line of pixels up to 1920 pixels for progressiveframe, but holds two rows of macro block properties for two lines ofpixels for adaptive frames. Such a configuration allows for highdefinition, but other configurations can be used here as well (e.g.,standard definition where each row is 720 pixels). Note that the linebuffer can be implemented with conventional technology. In oneembodiment, the line buffer is implemented with a single port SRAM thatis shared by the Y, U, and V components of the video signal during theinter prediction and intra prediction decoding.

The fractional interpolation block and in-line loop filter (FIB/ILF)memory interfaces to stage S5 of the pipeline via a 128-bit bus. In oneparticular embodiment, for every macro block, there are 18 data beats of128 bits that contain the macro block properties and expanded motionvector information of luma and chroma for forward and backwardreference. For every 16×16 macro block, the DSP macro block header/datamemory has forty-nine data beats of 128 bits. In particular, there isone data beat for the header and forty-eight data beats for thecoefficients. Note that the DSP macro block header/data memory sharesthe 128-bit bus (along with the FIB/ILF and the back P-vector memories 1and 2) to interface with stage S5 of the pipeline. The header includesinformation for the macro block properties.

Transform coefficients are stored in the macro block quantized DCTcoefficient memory, which is also interfaced with stage S5 of thepipeline via a 128-bit bus in this embodiment. In particular, thecoefficient memory saves all the data elements of 49 data beats of 128bits (one beat for the macro block header and 48 beats for QDCTcoefficients).

The back P-vector output memory 1 (write channel 1) and back P-vectoroutput memory 2 (write channel 2) share the 128-bit bus along with theFIB/ILF and the DSP macro block header/data memory to interface withstage S5 of the pipeline. These two write channels can be configured thesame as the back P-vector input memory 1 and back P-vector input memory2 read channels. If the current decoding picture is a reference picture,then all the properties for each macro block can be saved, for example,to off-chip DDR SDRAM or other memory and read back into the chip(assuming a system-on-chip implementation) when the current decodingpicture is a B-type picture. Unlike the read channels 1 and 2, as longas the current picture is marked as a reference picture (which could beI-type, P-type, or B-type picture), the picture can be saved to off-chipDDR SDRAM through these two write channels. Write channel 1 is forwriting even row macro blocks of frame pictures and the top fields offield pictures, while write channel 2 is for writing odd row macroblocks of frame pictures and bottom fields of field pictures.

In one particular embodiment, each of the FIB/ILF memory, DSP macroblock header/data memory, back P-vector output memory 1, back P-vectoroutput memory 2 are implemented as one of four sections of a dual portSRAM. Such a configuration enables conservation of physical space andpower consumption. Further note that the dual channels for read andwrite simplify architecture for direct mode motion vector prediction ofB-type pictures. The macro block quantized discrete cosine transform(QDCT) coefficient memory can also be implemented using a dual portSRAM.

The state machine configured to carryout separation of coefficients andrun level pairs (FIG. 5), the state machine configured to carryout runlevel pair decoding (FIG. 6), and state machine configured to carryouttransform, coefficient merging and motion vector prediction (FIG. 7) canbe used, for example, to control the interface between the output stageof the pipeline architecture (S4 or S5, depending on whether in the MainProfile or the High Profile) and the output memory (FIB/ILF memory, DSPmacro block header/data memory, back P-vector output memory 1, backP-vector output memory 2, and macro block quantized discrete cosinetransform (QDCT) coefficient memory).

Motion Vector Prediction/Intra Mode Prediction Scheme

FIG. 1B illustrates a motion vector prediction and intra mode predictionscheme configured in accordance with one embodiment of the presentinvention. This scheme carries out the interface between stage S4 (andS5 if High Profile) and the line buffer of FIG. 1A. Note that stage S1has already grouped the macro block properties of the picture into 4×4sub blocks.

As can seen, a portion of a frame is shown. The frame is divided into4×4 sub block (16 pixels). These sub blocks are grouped together to form8×8 blocks (64 pixels). These blocks are grouped together to form 16×16macro blocks (256 pixels). As previously stated, the macro blockproperties are stored in the line buffer interface with stage S4 via the80-bit bus. Each entry is associated with a horizontal reference and avertical reference.

In this example, the horizontal reference is one row of sixteen pixels(where each of the four downward pointing arrows shown in FIG. 1Brepresent four pixels from the current row). Each row is stored in theline buffer, which in this case is a 960×80 single port SRAM. Thevertical reference is provided by the four columns corresponding to thehorizontal reference row. These macro block properties are stored in avertical register (where the right and downward pointing arrowrepresents four pixels from one of the four current columns). Thisvertical register is updated for every 4×4 sub block for motion vectorprediction for Main Profile and High Profile. For intra mode prediction,the register is updated for every 4×4 sub block for Main Profile, andfor every 8×8 block for High Profile. This register is written out toanother larger vertical register the collects and holds all propertiesfor each macro block. This larger vertical register is updated for everymacro block.

A macro block property shifter receives macro block property data fromthe line buffer, and is configured with three shift registers in thisembodiment: a previous motion vector sample register, a current motionvector sample register, and a next motion vector sample register. Notethat, while the motion vector macro block properties are being processedin these embodiments, the pixels associated with those motion vectorscan also be processed with a similar structure, as discussed in thepreviously incorporated U.S. Ser. No. 11/137,971, filed May, 25, 2005,titled “Digital Signal Processing Structure for Decoding Multiple VideoStandards”, published as U.S. Patent Application Publication2006/0126740A1 on Jun. 15, 2006.

The macro block property format stored in the macro block propertyshifter can be as follows: REF0, REF1, Frame/Field picture, Slice,Intra/Inter prediction, Forward/Backward prediction, MV0, MV1. Here, MVis motion vector, REF is reference picture ID, 0 (zero) is for forwardand 1 (one) is for backward. Thus, MV0 is the forward motion vector, MV1is the backward motion vector, REF0 is the forward reference picture ID,and REF1 is the backward reference picture ID. Numerous register formatscan be used here, and the present invention is not intended to belimited to any one such format. Further detail of this shifter structureis discussed in the previously incorporated U.S. Ser. No. 11/137,971,filed May, 25, 2005, titled “Digital Signal Processing Structure forDecoding Multiple Video Standards”, published as U.S. Patent ApplicationPublication 2006/0126740A1 on Jun. 15, 2006.

In addition, although a 4×4 inter prediction mode scheme is shown inFIG. 1B, an 8×8 inter prediction mode scheme and/or a 16×16 interprediction mode scheme can also be implemented similarly here, asdiscussed in the previously incorporated U.S. Ser. No. 11/137,971, filedMay, 25, 2005, titled “Digital Signal Processing Structure for DecodingMultiple Video Standards”, published as U.S. Patent ApplicationPublication 2006/0126740A1 on Jun. 15, 2006. This structure can also beextended to perform inter prediction of adaptive frame mode with doubleresources for 4×4, 8×8, and 16×16 macro block structures. In addition,the 4×4, 8×8, and 16×16 macro block structures are discussed separately,but the shared pipeline structure can process macro block structures inrandom order (e.g., 4×4, then 16×16, then 4×4, then 8×8, etc).

ISE Memory Control State Machine

FIG. 2A illustrates an ISE memory control state machine configured inaccordance with one embodiment of the present invention. The statemachine remains in IDLE state during ISE memory underflow conditions, orin response to a reset, or during a decoding session. Otherwise, thenext state is ISE_MBPROP. Here, the first syntax group is read. Aspreviously explained, this syntax group has all the properties of thecurrent macro block. This state continues until all bits of the syntaxgroup string are read, as indicated by SynEosg (note that ˜SynEosgindicates more bits to read, while SynEosg indicates there are no morebits to read for the current syntax group).

The INTWT0 state provides a wait for the interrupt to finish. Once theinterrupt is finished (as indicated by ˜Interrupt), the next state isNEWSLICE. This starts the decoding for a new slice. The next stateCHKBPVC checks the back P-vector input memory 1 (read channel 1) and theback P-vector input memory 2 (read channel 2) for under flow condition.If there is under flow and the current decoding picture is a B-typepicture, then wait.

If there is not underflow (as indicated by ˜underflow), then start themacro block property checking. The CHKFLAG state checks the macro blockproperties for inter prediction, intra prediction, skip mode, and pulsecoded modulation (PCM) raw data mode. The PCMrawData state starts PCMraw data state machine (FIG. 4). The MBSkipMode state starts the skipmode processing, and uses the residual decoder state machine (FIG. 2B).The MBIntraPred state starts the macro block intra prediction processand shares the residual decoder state machine (FIG. 2B). The MBInterPredstate starts the macro block inter prediction process and also sharesthe residual decoder state machine (FIG. 2B).

The ISE memory control state machine generally operates as the mainstate machine that kicks off other state machines of the pipelinestructure, as will be apparent in light of this disclosure. When thecurrent stream is an H.264 stream and all the state machines are in IDLEstate (e.g., when there is no H.264 data to process), a power savingmode can be implemented. In such a mode, non-active decoding logic canbe shut down. Likewise, if the current stream is not an H.264 stream,then the whole pipeline architecture can be shut down. Various powerconsumption saving schemes can be used here.

Residual Decoder State Machine

FIG. 2B illustrates a residual decoder state machine configured inaccordance with one embodiment of the present invention. Here, themachine remains in IDLE state if there are no intra and inter predictionstarts (as indicated by INTERstart∥INTRAstart), and no skip start (asindicated by SKIPstart).

For the skip start mode, go to state CNT16. For intra and interprediction, start with the CHK16NOY state. Here, check for intra no luma(Y) case. If intra no luma, then go to state CNT16, where since there isno luma, there are total sixteen 4×4 sub blocks of zero. Else, go to theCHKISE state and start normal decoding.

From state CNT16, the CHKRESQY state checks to see if the input FIFOqueue of stage S1 is fill (as indicated by ResQfull). If the FIFO isfull, then wait for the FIFO to be read. If the FIFO is not fill, goback to state CNT16 and wait for the Res16zero signal, and then proceedto state CHKCHMA.

The CHKISE state determines if there is an ISE memory underflow (asindicated by IseMemUnderflow), and the process waits here until the ISEMEMORY has enough macro block data. If not intra 16×16 mode (asindicated by MBInterPred∥MBIntra 4×4), then the CHKISE state proceeds tothe YBGN state to start normal AC coefficient decoding for luma untildone, as indicated by YresDone (note that ˜YresDone indicates lumadecoding is not done, while YresDone indicated luma coding is done).Otherwise, go to the YDCSYN state, which is the luma DC coefficientdecoding state if intra 16×16 mode. The CHKYAC state checks for luma ACcoefficient once the data is received (as indicated by SynEosg). Fromthe CHKYAC state, if there is no luma AC coefficient (as indicated by˜MBIntra4×4 & IntraNOY), then go to the CNT16 state. Else if there isluma AC coefficient, go to the YBGN state, and start normal ACcoefficient decoding for luma until done.

When the AC coefficient decoding for luma is done, the CHKCHMA statechecks for chroma. If there is no chroma (as indicated byNoChroma∥MBSkipMode), then go to the CNT8 state to output eight 4×4 subblocks of zero. Otherwise, go to UVBGN and start chroma decoding. Fromstate CNT8, the CHKRESQUV state checks to see if the input FIFO queue ofstage S1 is full (as indicated by ResQfull). If the FIFO is full, thenwait for the FIFO to be read. If the FIFO is not full, go back to stateCNT8 and wait for the Res8zero signal, and then proceed to state MBEND.

The UVBGN state starts Chroma decoding until all chroma decoding is done(as indicated by UVresDone). The MBEND state issues a macro block endsignal, and returns to the IDLE state so that a new macro block can beprocessed.

Back P-Vector Read/Write State Machines

FIG. 3A illustrates a back P-vector read state machine configured inaccordance with one embodiment of the present invention. The machineremains in IDLE state as long as there is a back P-vector input memoryunderflow (as indicated by BackPvectUnderflow).

The WRTOP state reads the reference data from channel 1. This is mainlyfor even macro block row or top field reads. The WAIT state providesdelay between the channel 1 and channel 2 reads. In addition, the WRTOPstate indicates that the top macro block read is done and only one macroblock was read. In this case, go to the RDEND state to start searchinglogic ID (as indicated by TopMBDone & OneMBonly).

The WRBOT state reads the reference data from channel 2. This is mainlyfor odd macro block row or bottom field reads. Then bottom macro blockread is indicated by BotMBrdDone. Otherwise, the process waits at theWRBOT state.

The RDEND state receives indications from the WRTOP and WRBOT statesthat the respective reads are done,-and is the macro block read end. Ifin spatial direct mode or the current macro block is a PCM raw datamacro block (as indicated by Spatial Dir∥PCMrawData), then go back tothe IDLE state. No logic ID searching is performed here.

Else, the CHKDM state checks for the temporal direct mode. If thecurrent block is temporal direct mode (as indicated by CurrBlkTemDir),then the SCHID state starts searching logic ID until all four logic IDfor all 4 blocks are found. The search is carried out on a block byblock basis. If the current block is not temporal direct mode, then goto the INC state and skip to the next block. The INC state increments tonext block.

The SCHID state searches the ID state until a match is found (asindicated by IDMatch). Note that it may take 1 to 32 cycles to find onelogic ID match. Upon finding all the logic ID (as indicated by theIDMatch & PhyIDCntr3), the flow returns to the IDLE state to wait fornew task.

FIG. 3B illustrates a back P-vector write state machine configured inaccordance with one embodiment of the present invention. As can be seen,the write state machine is relatively simple compared to the read statemachine. If the back P-vector write channels 1 and 2 are not overflowed(as indicated by bpvoOverflow) and the current macro block properties tothe FIB/ILF module have been saved and the current picture is areference picture, then go to the WRT state. Otherwise, stay at the IDLEstate. At the WRT state, the P-vector write is continued until MBDone isasserted. Then, go to IDLE.

PCM Raw Data State Machine

FIG. 4 illustrates a PCM raw data state machine configured in accordancewith one embodiment of the present invention. The state machine remainsin IDLE state during non PCM raw data modes, or in response to a reset.Otherwise, the next state is GETEOS1, which gets the first syntax group.

The ISEWT-ISEWT2 states wait to check the ISE memory underflow condition(as indicated by IseMemUnderFlow). When there is no ISE memory underflowcondition, the ISERD state reads data from the ISE memory until the endof the current syntax group (as indicated by SynEosg). The LDADR stateand the LDWT state both enable the macro block header write during thePCM raw data mode. The LDWT state checks for an overflow condition ofoutput memory (as indicated by QdctOverflow∥MbbSmphOverflow).

The LDRAW state loads raw data for two blocks, and the flow continues tothe CHK192 state. In this particular embodiment, the raw scan mode isnot 4×4 sub block format. It is raster scan and sixteen pixels per line,with two blocks being a 64 count and six blocks being 192 counts. Thus,the flow from the ISEWT state to the CHK192 state is repeated until 192counts is reached (as indicated by Syntax192).

After 192 counts is reached, then start the next macro block. The MBENDstate issues the end of macro block in response to the Syntax192condition being satisfied, and the PCMCNT16 state counts sixteen subblocks (as indicated by PcmCnt16), and then ends with the IDLE state.

Separation of Coefficients and Run Level Pair State Machine

FIG. 5 illustrates a syntax group math state machine configured tocarryout separation of coefficients and run level pairs in accordancewith one embodiment of the present invention. Here, the machine remainsin IDLE state if the input FIFO queue of stage S1 is empty (as indicatedby ResQempty).

The LDSYN state loads the syntax group if the input FIFO queue is notempty. The CHKOSUB state checks the current 4×4 sub block for zero subblock. If the current sub block is all zero, all coefficient (no zero)as indicated by CurSubZero∥ZeroCoeff∥AllCoeff, then go to the CHKFULLstate. Else, the MATHOP state does level load to the registers untilthere are no more entries in the current sub block, as indicated byMathOPDone0. The MathOPDone0 signal is controlled by the number ofnonzero level coefficients (note that ˜MathOPDone0 indicates that thereare more entries in the current sub block).

When there are no more entries in the current sub block, the CHKFULLstate checks the next stage FIFO for full condition. If it is full (asindicated by Runb4Full), wait here until it is not fill. If the nextstage FIFO is not full, then the LDPROP state writes to next stage FIFO.

Run Level Pair Decoding State Machine

FIG. 6 illustrates a syntax group math state machine configured tocarryout run level pair decoding in accordance with one embodiment ofthe present invention. The machine remains in IDLE state if the previousstage FIFO is empty (as indicated by Runb4empty). Else, the LDSYN stateloads the current register set with the previous stage FIFO if it is notempty.

The CHK0SUB state determines if the loaded data is zero sub block (asindicated by P1SubZero∥P1AllCoeff). If so, then go to CHKFULL todetermine if the QDCT data FIFO is full. Else, if there is a trailingone, then go to the LD1ST state to start the trailing one decoding.Else, if CABAC mode, then go to the CABAC state. Otherwise, go to theMATHOP state for normal run level decoding.

The LD1ST state searches for the first trailing one and checks for asecond trailing one. If there is a second trailing one, then go to theLD2ND state. Else, if only one trailing one exists (as indicated byOnlyTrlOne && YUV1TrlOne), then go to the CHKFULL state to check to seeif the QDCT data FIFO is full. Else, go to the MATHOP state for normaldecoding flow (as indicated by YUV1TrlOne).

The LD2ND state searches for the second trailing one and checks for athird trailing one. If there is a third trailing one, then go to theLD3RD state. Else, if only two trailing ones exist (as indicated byOnlyTrlOne && YUv2TrlOne), then go to the CHKFULL state to check to seeif the QDCT data FIFO is full. Else, go to the MATHOP state for normaldecoding flow (as indicated by YUV2TrlOne).

The LD3rd state searches for the third trailing one. If only trailingone exists (as indicated by OnlyTrlOne), then go to the CHKFULL state tocheck to see if the QDCT data FIFO is full. Else, go to the MATHOP statefor normal decoding flow.

The MATHOP state carries out normal run level decoding flow. The CHKFULLstate determines whether the QDCT data FIFO is full. If it is not full(as indicated by QdctDFull), then go to the LDPROP state, which willwrite the current sub block into the QDCT data FIFO. Else, wait at theCHKFULL state until the QDCT data FIFO is not full.

The CABAC state carries out CABAC decoding, and continues until thatdecoding is done, as indicated by CABACDone (where ˜CABACDone indicatesthat the CABAC decoding is not done, and CABACDone indicates that theCABAC decoding is done). When CABAC decoding is done, the go to LDWAITstate, and then continue to the CHKFULL state.

Transform, Coefficient Merging, and Motion Vector Prediction StateMachine

FIG. 7 illustrates a state machine configured to carryout transform,coefficient merging and motion vector prediction in accordance with oneembodiment of the present invention. The machine remains in IDLE stateas long as the output memory is full or overflowed or the QDCT data FIFOis empty (as indicated by QdctSmphOverflow∥mbbSmphOverflow∥QdctDEmpty).Else, the LDSYN state loads the current sub block QDCT data into theoutput memory. Note that this involves reading the QDCT data FIFO.During the same period that the current sub block is decoded, the motionvector prediction for the corresponding 4×4 sub block is computed.

The WRCEL0 state writes the first half of current sub block, and theWRCEL1 state writes the 2nd half of current sub block. The CHKEND statechecks for the end of the current macro block. If not the end of macroblock (as indicated by Subblk23), then the INCSUB state continues theoperation until the end of macro block (24 sub blocks) is reached.

The CHKEMP state determines if the QDCT data FIFO is empty. If it isempty (as indicated by QdctDEmpty), then wait at the CHKEMP state. Else,go to the LDSYN state and start the next sub block. Note that additionalcycles can be added between the INCSUB state and the CHKEMP state for Yprocessing (as compared to UV processing), so the motion vectorcomputation has more time.

The WRMBH0 state writes the first macro block header to the in-line loopfilter (ILF) and fractional interpolation block (FIB). The WRMBH1 statewrites the second macro block header to the ILF and FIB. These are thefirst two entries of the data element to other blocks (FIB and ILF). Theremaining 16 data beats are for motion vector writes, which are executedin next state: QDCT_MV16. The QDCT_MV16 state writes all the motionvectors to output memory. When the motion vector writes are complete (asindicated by Mbk15saved). Note that there are total of eighteen databeats for luma and chroma for this embodiment.

The foregoing description of the embodiments of the invention has beenpresented for the purposes of illustration and description. It is notintended to be exhaustive or to limit the invention to the precise formdisclosed. Many modifications and variations are possible in light ofthis disclosure. For instance, the bus sizes (e.g., 128-bit, 24-bit, and80-bit) between the memories and the macro block builder can be varied,depending on the particular application at hand. It is intended that thescope of the invention be limited not by this detailed description, butrather by the claims appended hereto.

1. A video decoding system configured with a multi-stage shared pipelinearchitecture for carrying out the H.264 CABAC and CALVC entropy decodingprocesses in Main Profile and High Profile, comprising: a first stagefor grouping macro block properties into 4×4 sub blocks; a second stagefor performing separation of coefficients and run level pairs; a thirdstage for performing run level pair decoding; a fourth stage forperforming 4×4 block zigzag transform, DC/AC coefficient merging, andmotion vector prediction; and a fifth stage for performing 8×8transforms in the High Profile, and is skipped in the Main Profile. 2.The system of claim 1 wherein the fourth stage has four split modes forthe motion vector prediction: 16×16, 16×8, 8×16, and 8×8, and each splitmode has its own motion vector computation logic, reference pictureindex computation, and frame field scaling.
 3. The system of claim 1wherein B-type picture decoding has a dual channel read write structure,and logic index ID searching is performed on a block by block basis. 4.The system of claim 3 wherein a block N+1 logic index ID is calculatedwhile motion vector prediction is performed for block N, therebymitigating searching time.
 5. The system of claim 1 further comprising amemory interfaced to the first stage via an N-bit bus, wherein there area plurality of syntax groups of N-bit data inside each macro block, andthe memory provides the first stage with the length for every syntaxgroup.
 6. The system of claim 5 wherein the first stage includes amemory control state machine that collects information about macro blockproperties and groups it into 4×4 sub blocks for all predictions.
 7. Thesystem of claim 5 wherein N=24, and the N-bits designate current macroblock properties including at least one of: intra prediction, interprediction, skip mode, raw data mode, macro block ID, slice ID, directmode, macro block split mode, sub block split mode which can be down to4×4 for luma and 2×2 for chroma and all corresponding intra predictionflag and motion vector differences on X and Y direction, picturereference index for forward and backward predictions, CABAC bit map andlevel, and CALVC run level information.
 8. The system of claim 1 whereinthe first stage includes a residual decoder state machine for carryingout residual decoding of I-type, P-type, and B-type pictures.
 9. Thesystem of claim 1 wherein the first stage includes a PCM raw data statemachine for carrying put a PCM raw data mode.
 10. The system of claim 1further comprising a dual channel read memory interfaced to the firststage via an N-bit bus, wherein for frame pictures, channel 1 of thememory is for even macro block row reference reads and channel 2 of thememory is for odd macro block reference read, and for field pictures,channel 1 of the memory is for a top field picture and channel 2 of thememory is for a bottom field picture reference read.
 11. The system ofclaim 1 further comprising a dual channel read memory interfaced to thefirst stage via an N-bit bus, wherein if a current decoding picture is aframe picture and a corresponding reference picture is a field picture,either top field or bottom field, then both channel 1 and channel 2 ofthe memory have the same reference picture, thereby facilitating B-typepicture decoding.
 12. The system of claim 11 wherein if current decodingpicture is a B-type picture, then the reference picture is read from aDDR SDRAM for direct mode motion vector prediction.
 13. The system ofclaim 1 further comprising a dual channel read memory interfaced to thefirst stage via a 128-bit bus, and for every macro block there are fivedata beats of 128 bits that contain macro block properties, logical IDand physical ID of every 8×8 block inside a 16×16 macro block, andmotion vectors of every 4×4 block.
 14. The system of claim 1 furthercomprising a line buffer used for motion vector prediction of interprediction and intra prediction mode decoding.
 15. The system of claim 1further comprising a dual channel write memory interfaced to the fifthstage via an N-bit bus, wherein if a current decoding picture is a fieldpicture, either top field or bottom field, then the top field is writtenout through channel 1 and the bottom field is written out throughchannel 2 of the memory, thereby facilitating B-type picture decoding.16. The system of claim 1 further comprising a dual channel write memoryinterfaced to the fifth stage via an N-bit bus, wherein if a currentdecoding picture is a reference picture, then properties for each macroblock are saved to a DDR SDRAM and read back when current decodingpicture is a B-type picture. 17-18. (canceled)
 19. The system of claim 1further comprising: a macro block quantized DCT coefficient memoryinterfaced to the fifth stage via a 128-bit bus, for storing dequantizedDC/AC coefficients.
 20. The system of claim 1 where the second stageincludes a state machine for carrying out separation of coefficients andrun level pairs.
 21. The system of claim 1 where the third stageincludes a state machine for carrying out run level pair decoding. 22.The system of claim 1 where the fourth and fifth stages include a statemachine for carrying out transform, DC/AC coefficient merging, andmotion vector prediction.
 23. (canceled)
 24. A video decoding systemconfigured with a multistage shared pipeline architecture for carryingout the H.264 CABAC and CALVC entropy decoding processes in Main Profileand High Profile, comprising: a first stage including a memory controlstate machine that collects information about macro block properties andgroups it into 4×4 blocks for all predictions; a second stage forperforming separation of coefficients and run level pairs; a third stagefor performing run level pair decoding; a fourth stage for performing4×4 block zigzag transform, DC/AC coefficient merging, and motion vectorprediction, wherein the fourth stage has four split modes for the motionvector prediction: 16×16, 16×8, 8×16, and 8×8, and each split mode hasits own motion vector computation logic, reference picture indexcomputation, and frame field scaling; a fifth stage for performing 8×8transforms in the High Profile, and is skipped in the Main Profile; anddual read channels and dual write channels for direct mode motion vectorprediction of B-type pictures.
 25. (canceled)